Case Study: Factors Influencing Life Expectancy using Linear Regression

Context:

  • There have been lot of studies undertaken in the past, on factors affecting life expectancy, considering demographic variables, income composition and mortality rates.
  • It was found that affect of immunization and human development index was not taken into account in the past.
  • Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this case study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well.
  • In this case study, we will use linear regression to see the effect of various factors on Life Expectancy.

Problem:

The data-set aims to answer the following key questions:

  • Does various predicting factors really affect the Life expectancy?
  • What are the predicting variables actually affecting the life expectancy?
  • Should a country having a lower life expectancy value increase its healthcare expenditure in order to improve its average lifespan?
  • Do Infant and Adult mortality rates affect life expectancy?
  • Does Life Expectancy has positive or negative correlation with a country's status (developing or developed), lifestyle, GDP, etc.
  • What is the impact of schooling on the lifespan of humans?
  • Does Life Expectancy have positive or negative relationship with drinking alcohol?
  • What is the impact of Immunization coverage (for various disease like Measles,Hepatitis B) on life Expectancy?

Attribute Information:

  • Country: Country
  • Year: Year
  • Status: Developed or Developing status
  • Life expectancy: Life Expectancy in age
  • Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
  • infant deaths: Number of Infant Deaths per 1000 population
  • Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
  • Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  • Measles: Measles - number of reported cases per 1000 population
  • BMI: Average Body Mass Index of entire population
  • under-five deaths: Number of under-five deaths per 1000 population
  • Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
  • Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
  • Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  • HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
  • GDP: Gross Domestic Product per capita (in USD)
  • Population: Population of the country
  • thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
  • thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
  • Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
  • Schooling: Number of years of Schooling(years)

Import libraries

In [1]:
# Import necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
C:\Users\User\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

Load and explore the data

In [2]:
# Load the data into pandas dataframe
data = pd.read_csv('Life Expectancy Data.csv')              # Make changes to the path depending on where your data file is stored.
In [3]:
data.head()
Out[3]:
Country Year Status Life expectancy Adult Mortality Infant deaths Alcohol Percentage expenditure Hepatitis B Measles ... Polio Total expenditure Diphtheria HIV/AIDS GDP Population Thinness 1-19 years Thinness 5-9 years Income composition of resources Schooling
0 Afghanistan 2015 Developing 65.0 263.0 62 0.01 71.279624 65.0 1154 ... 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 Afghanistan 2014 Developing 59.9 271.0 64 0.01 73.523582 62.0 492 ... 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 Afghanistan 2013 Developing 59.9 268.0 66 0.01 73.219243 64.0 430 ... 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 Afghanistan 2012 Developing 59.5 272.0 69 0.01 78.184215 67.0 2787 ... 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 Afghanistan 2011 Developing 59.2 275.0 71 0.01 7.097109 68.0 3013 ... 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5

5 rows × 22 columns

In [4]:
# Check number of rows and columns
data.shape
Out[4]:
(2938, 22)
In [5]:
# Have a look at the column names
data.columns
Out[5]:
Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'Infant deaths', 'Alcohol', 'Percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'Under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'Thinness  1-19 years',
       'Thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')
In [6]:
# Check column types and missing values
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   Infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   Percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10  BMI                              2904 non-null   float64
 11  Under-five deaths                2938 non-null   int64  
 12  Polio                            2919 non-null   float64
 13  Total expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15  HIV/AIDS                         2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18  Thinness  1-19 years             2904 non-null   float64
 19  Thinness 5-9 years               2904 non-null   float64
 20  Income composition of resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB
In [7]:
# remove the rows of data which have missing value(s)
data = data.dropna()
In [8]:
# Check the unique values in each column of the dataframe.
data.nunique()
Out[8]:
Country                             133
Year                                 16
Status                                2
Life expectancy                     320
Adult Mortality                     369
Infant deaths                       165
Alcohol                             833
Percentage expenditure             1645
Hepatitis B                          83
Measles                             603
BMI                                 538
Under-five deaths                   199
Polio                                68
Total expenditure                   669
Diphtheria                           66
HIV/AIDS                            167
GDP                                1649
Population                         1647
Thinness  1-19 years                179
Thinness 5-9 years                  185
Income composition of resources     548
Schooling                           147
dtype: int64

Insights:

  • The "Status" column has 2 unique values. i.e. The values are "Developing" and "Developed"
  • The "Country" column has 133 unique values. i.e. The data is collected from 133 countries.
In [9]:
plt.figure(figsize=(10,7))
plt.scatter(data['Schooling'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Schooling', fontsize=14)
plt.xlabel('Schooling', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()

Insights:

  • Clearly, we can see in the above plot that, when the number of years of schooling is increasing, the Life expectancy is also increasing.
  • The relationship is linear, as can be seen from the graph.

Dependent vs. Target

In [10]:
plt.figure(figsize=(10,7))
plt.scatter(data['Measles'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Measles ', fontsize=14)
plt.xlabel('Measles', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
In [11]:
plt.figure(figsize=(10,7))
plt.scatter(data['Alcohol'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Alcohol', fontsize=14)
plt.xlabel('Alcohol', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
In [12]:
plt.figure(figsize=(10,7))
plt.scatter(data['Total expenditure'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Total expenditure', fontsize=14)
plt.xlabel('Total expenditure', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()
In [13]:
plt.figure(figsize=(10,7))
plt.scatter(data['Adult Mortality'], data['Life expectancy'], color='red')
plt.title('Life expectancy Vs Adult Mortality', fontsize=14)
plt.xlabel('Adult Mortality', fontsize=14)
plt.ylabel('Life expectancy', fontsize=14)
plt.grid(True)
plt.show()

Insights:

  • Here, we see that, as the Adult Mortality is increasing, the Life expectancy is decreasing.
  • So, there is inverse relationship between this feature on X-axis and the Target variable on Y-axis.
  • Even though the relationship is inverse, but it is linear relationship, as can be seen from the graph.

Conclusion:

  • Yes, by above analysis, we can conclude that, the dataset is a good fit for the application of Linear Regression.
In [14]:
plt.figure(figsize=(10,7))
plt.scatter(data['Life expectancy'][:200], data['Country'][:200], color='red')
plt.title('Country Vs Life expectancy', fontsize=14)
plt.xlabel('Life expectancy', fontsize=14)
plt.ylabel('Country', fontsize=14)
plt.grid(True)
plt.show()

Insights:

  • Life expectancy by the country is having a spread of data. (i.e. Some countries might have higher life expectancy and some lower, in the graph, we have plotted 200 data points which are from 15 countries (The names of countries are on Y-axis)
  • However, we can see that the red dots are more to the right (near 90 years) for some countries and to the left (less than 50 years) for some countries.
In [15]:
sns.pairplot(data, height=3, diag_kind='auto', corner=True)
plt.show()

Insights:

  • We can see from the 2nd and 3rd column of the graph that, these features have linear (or inversely linear) relationship.
In [16]:
plt.figure(figsize=(10,10))
sns.boxplot(data['Life expectancy'], orient='v')
plt.show()

Insights:

  • The bottom black horizontal line of blue box plot is minimum value, which is around 48 years.
  • First black horizontal line of rectangle shape of blue box plot is First quartile or 25 percentile. The age corresponding to this line is 64 years.
  • Second black horizontal line of rectangle shape of blue box plot is Second quartile or 50% or median. The age corresponding to this is 72 years.
  • Third black horizontal line of rectangle shape of blue box plot is third quartile or 75%. The age corresponding to this is 75 years.
  • Top black horizontal line of rectangle shape of blue box plot is maximum value. The value corresponding to this is 88 years.
  • Small diamond shape of blue box plot is outlier data or erroneous data. Which is below the bottom black horizontal line.
In [17]:
plt.figure(figsize=(8,8))
sns.boxplot(x="Status",y="Life expectancy",data=data)
plt.show()

Insights:

  • The life expectancy in the developed countries is much higher than the developing countries.
  • The median value of life expectancy (approximately as can be seen from the boxplot) for:
    • Developing: 69 years
    • Developed: 78 years
In [18]:
data[data.columns[:]].corr()['Life expectancy'][:]
Out[18]:
Year                               0.050771
Life expectancy                    1.000000
Adult Mortality                   -0.702523
Infant deaths                     -0.169074
Alcohol                            0.402718
Percentage expenditure             0.409631
Hepatitis B                        0.199935
Measles                           -0.068881
BMI                                0.542042
Under-five deaths                 -0.192265
Polio                              0.327294
Total expenditure                  0.174718
Diphtheria                         0.341331
HIV/AIDS                          -0.592236
GDP                                0.441322
Population                        -0.022305
Thinness  1-19 years              -0.457838
Thinness 5-9 years                -0.457508
Income composition of resources    0.721083
Schooling                          0.727630
Name: Life expectancy, dtype: float64
In [19]:
plt.figure(figsize=(20,20))
sns.heatmap(data.corr(), annot=True, fmt=".2")
plt.show()
In [20]:
X = data.drop('Life expectancy', axis=1)
y = data[['Life expectancy']]

print(X.head())
print(y.head())
       Country  Year  ... Income composition of resources  Schooling
0  Afghanistan  2015  ...                           0.479       10.1
1  Afghanistan  2014  ...                           0.476       10.0
2  Afghanistan  2013  ...                           0.470        9.9
3  Afghanistan  2012  ...                           0.463        9.8
4  Afghanistan  2011  ...                           0.454        9.5

[5 rows x 21 columns]
   Life expectancy
0             65.0
1             59.9
2             59.9
3             59.5
4             59.2
In [21]:
print(X.shape)
print(y.shape)
(1649, 21)
(1649, 1)
In [22]:
X.loc[:, 'Country'] = X.loc[:, 'Country'].astype('category')
X.loc[:, 'Status'] = X.loc[:, 'Status'].astype('category')

X.loc[:, 'Country'] = X.loc[:, 'Country'].cat.codes
X.loc[:, 'Status'] = X.loc[:, 'Status'].cat.codes

X.head()
Out[22]:
Country Year Status Adult Mortality Infant deaths Alcohol Percentage expenditure Hepatitis B Measles BMI Under-five deaths Polio Total expenditure Diphtheria HIV/AIDS GDP Population Thinness 1-19 years Thinness 5-9 years Income composition of resources Schooling
0 0 2015 1 263.0 62 0.01 71.279624 65.0 1154 19.1 83 6.0 8.16 65.0 0.1 584.259210 33736494.0 17.2 17.3 0.479 10.1
1 0 2014 1 271.0 64 0.01 73.523582 62.0 492 18.6 86 58.0 8.18 62.0 0.1 612.696514 327582.0 17.5 17.5 0.476 10.0
2 0 2013 1 268.0 66 0.01 73.219243 64.0 430 18.1 89 62.0 8.13 64.0 0.1 631.744976 31731688.0 17.7 17.7 0.470 9.9
3 0 2012 1 272.0 69 0.01 78.184215 67.0 2787 17.6 93 67.0 8.52 67.0 0.1 669.959000 3696958.0 17.9 18.0 0.463 9.8
4 0 2011 1 275.0 71 0.01 7.097109 68.0 3013 17.2 97 68.0 7.87 68.0 0.1 63.537231 2978599.0 18.2 18.2 0.454 9.5
In [23]:
X = X.values
y = y.values
In [24]:
#split the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [25]:
from sklearn.linear_model import LinearRegression
linearregression = LinearRegression()                                    
linearregression.fit(X_train, y_train)                                  

print("Intercept of the linear equation:", linearregression.intercept_) 
print("\nCOefficients of the equation are:", linearregression.coef_)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

pred = linearregression.predict(X_test)                              
Intercept of the linear equation: [278.83730444]

COefficients of the equation are: [[-1.67819761e-04 -1.12155547e-01 -9.19361543e-01 -1.58907134e-02
   8.79604795e-02 -1.70626950e-01  3.64283517e-04 -5.92563803e-03
  -9.88604544e-06  2.47918291e-02 -6.60330507e-02  5.37760460e-03
   1.02852941e-01  1.57086062e-02 -4.35097774e-01  7.70238420e-06
   4.59904862e-10 -4.51871709e-02 -4.55238012e-02  8.96719230e+00
   1.02122297e+00]]
In [26]:
# Mean Absolute Error
mean_absolute_error(y_test, pred)
Out[26]:
2.7891955304890574

The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals.

In [27]:
# RMSE
mean_squared_error(y_test, pred)**0.5
Out[27]:
3.6440168718313846

The root mean square error (RMSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. And then takes the square root of the value.

In [28]:
# R2 Squared:
r2_score(y_test, pred)
Out[28]:
0.8319101484774947

R^2 (coefficient of determination) regression score function.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

In [29]:
# Training Score
linearregression.score(X_train, y_train)
Out[29]:
0.8392652670918924
In [30]:
# Testing score
linearregression.score(X_test, y_test)
Out[30]:
0.8319101484774947

X_test Vs. Predicted values:

In [31]:
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': pred.flatten()})
df
Out[31]:
Actual Predicted
0 67.5 71.798448
1 73.8 72.533946
2 79.1 80.672907
3 54.9 54.442474
4 48.6 52.130992
... ... ...
490 64.8 69.560705
491 71.4 71.494676
492 77.2 74.751983
493 78.6 76.076827
494 68.0 68.252254

495 rows × 2 columns

  • We can also visualize comparison result as a bar graph using the below script :

  • Note: As the number of records is huge, for representation purpose I’m taking just 25 records.

In [32]:
df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()
  • We can observe here that our model has returned pretty good prediction results.
  • The Actual and Predicted values are comparable.

LinearRegression model:

In [33]:
linearregression_ = LinearRegression(normalize=True)                            # Linear Regression with hyperparameter normalize=True.             
linearregression_.fit(X_train, y_train)                                  

print("Intercept of the linear equation:", linearregression_.intercept_) 
print("\nCOefficients of the equation are:", linearregression_.coef_)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

pred = linearregression_.predict(X_test)    
Intercept of the linear equation: [278.83730444]

COefficients of the equation are: [[-1.67819762e-04 -1.12155547e-01 -9.19361543e-01 -1.58907134e-02
   8.79604795e-02 -1.70626950e-01  3.64283517e-04 -5.92563803e-03
  -9.88604545e-06  2.47918291e-02 -6.60330507e-02  5.37760460e-03
   1.02852941e-01  1.57086062e-02 -4.35097774e-01  7.70238420e-06
   4.59904864e-10 -4.51871709e-02 -4.55238012e-02  8.96719230e+00
   1.02122297e+00]]
In [34]:
# Mean Absolute Error
mean_absolute_error(y_test, pred)
Out[34]:
2.789195530493078
In [35]:
# RMSE
mean_squared_error(y_test, pred)**0.5
Out[35]:
3.644016871836159
In [36]:
# R2 Squared:
r2_score(y_test, pred)
Out[36]:
0.8319101484770541
In [37]:
# Training Score
linearregression_.score(X_train, y_train)
Out[37]:
0.8392652670918922
In [38]:
# Testing score
linearregression_.score(X_test, y_test)
Out[38]:
0.8319101484770541
  • The Training and testing scores are around 83% and both scores are comparable, hence the model is a good fit.

  • R2_score is 0.83, that explains 83 % of total variation in the dataset. So, overall the model is very satisfactory.

  • Setting normalize=True do have impact on co-efficients but they dont affect the best fit line anyway. (refer the coefficients and intercepts are changed, but the accuracy is same as before.)

In [39]:
import statsmodels.api as sm

X = sm.add_constant(X)

linearmodel = sm.OLS(y, X).fit()

predictions = linearmodel.predict(X) 

print_model = linearmodel.summary()
print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.839
Model:                            OLS   Adj. R-squared:                  0.837
Method:                 Least Squares   F-statistic:                     402.6
Date:                Thu, 30 Jul 2020   Prob (F-statistic):               0.00
Time:                        10:27:19   Log-Likelihood:                -4421.1
No. Observations:                1649   AIC:                             8886.
Df Residuals:                    1627   BIC:                             9005.
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        308.5849     46.228      6.675      0.000     217.911     399.258
x1             0.0012      0.002      0.504      0.614      -0.003       0.006
x2            -0.1270      0.023     -5.500      0.000      -0.172      -0.082
x3            -0.8838      0.335     -2.635      0.008      -1.542      -0.226
x4            -0.0162      0.001    -17.167      0.000      -0.018      -0.014
x5             0.0891      0.011      8.389      0.000       0.068       0.110
x6            -0.1294      0.034     -3.820      0.000      -0.196      -0.063
x7             0.0003      0.000      1.706      0.088   -4.58e-05       0.001
x8            -0.0033      0.004     -0.745      0.456      -0.012       0.005
x9         -1.025e-05   1.07e-05     -0.958      0.338   -3.12e-05    1.07e-05
x10            0.0318      0.006      5.340      0.000       0.020       0.043
x11           -0.0669      0.008     -8.694      0.000      -0.082      -0.052
x12            0.0056      0.005      1.099      0.272      -0.004       0.016
x13            0.0926      0.040      2.289      0.022       0.013       0.172
x14            0.0141      0.006      2.404      0.016       0.003       0.026
x15           -0.4492      0.018    -25.015      0.000      -0.484      -0.414
x16         2.445e-05   2.83e-05      0.865      0.387    -3.1e-05    7.99e-05
x17        -6.265e-10   1.73e-09     -0.361      0.718   -4.03e-09    2.77e-09
x18           -0.0029      0.053     -0.054      0.957      -0.107       0.101
x19           -0.0521      0.052     -1.001      0.317      -0.154       0.050
x20           10.4554      0.833     12.552      0.000       8.822      12.089
x21            0.8937      0.059     15.106      0.000       0.778       1.010
==============================================================================
Omnibus:                       32.182   Durbin-Watson:                   0.708
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               59.063
Skew:                          -0.106   Prob(JB):                     1.49e-13
Kurtosis:                       3.903   Cond. No.                     3.80e+10
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.8e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpreting the Regression Results:

  1. Adjusted. R-squared:
    • The value for Adj. R-squared is 0.837, which is good!
  2. const coefficient .
    • It means that if all the dependent variables (features: like Country, status, Adult mortality and so on..) coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.
    • In our case, the value for const coeff is 308.5849
  3. Status coeff: It represents the change in the output Y due to a change in the status (everything else held constant).
  4. Schooling coeff: It represents the change in the output Y due to a change of one unit in the Schooling (everything else held constant).

  5. P >|t|:

    • A p-value of less than 0.05 is considered to be statistically significant.
  6. Confidence Interval: It represents the range in which our coefficients are likely to fall (with a likelihood of 95%).
In [40]:
# Plot between residual(actual - predicted) and predicted values
plt.figure(figsize=(10,8))
plt.scatter(linearmodel.resid, linearmodel.predict(), marker='*')
plt.show()
In [41]:
# error distribution
plt.figure(figsize=(10,8))
sns.distplot(linearmodel.resid, hist=True, kde=False, color='red')
plt.show()